# COMPENG 4DM4 Assignment 2 Report

Aaron Pinto pintoa9 Raeed Hassan hassam41

January 12, 2023

# Assumptions

In addition to the assumptions stated in the assignment, the following assumptions are made for all parts of the assignment:

- FP-ADD unit 3-stage pipeline labeled as A1, A2, A3
- FP-MULT unit 6-stage pipeline labeled as FM1, FM2, FM3, FM4, FM5, FM6
- Full forwarding of data from WB stage to any stage
- assume dual-ported memory that allows simultaneous read and/or write on two ports
- we can perform 2 write-backs per clock cycle

# Part (a): DAXPY Loop, No Unrolling, with No Scheduling

The timing diagram is shown in Figure 1. The timing diagram is also submitted as 4DM4-Assignment-#2a-Basic-Timing-Table-Group-47-RH,AP.xlsx. The performance, or MFLOP rating, of the implementation with a 3 GHz clock is  $(3 \text{ GHz})^*(2 \text{ FLOP}/19 \text{ cc}) = 315.8 \text{ MFLOP/s}$ .

Figure 1: Timing Diagram for DAXPY Loop (No Unrolling, No Scheduling)

|     | Instruc | tion         |    |    |    | Clock Cy | cle   |       |      |       |       |       |     |       |      |    |       |     |    |       |    |    | Comment                                         |
|-----|---------|--------------|----|----|----|----------|-------|-------|------|-------|-------|-------|-----|-------|------|----|-------|-----|----|-------|----|----|-------------------------------------------------|
|     |         |              | 1  | 2  | 3  | 4        | 5     | 6     | 7    | 8     | 9     | 10    | 11  | 12    | 13   | 14 | 15    | 16  | 17 | 18    | 19 | 20 |                                                 |
| loo |         | F2, 0(R1)    | F1 | F2 | ID | EX       | M1    | M2"   | WB   |       |       |       |     |       |      |    |       |     |    |       |    |    | forward F2 (from M2* to *FM1) in cc6            |
|     | MULT.D  | F4, F2, F0   |    | F1 | F2 | ID       | stall | stall | *FM1 | FM2   | FM3   | FM4   | FM5 | FM6"  | WB   |    |       |     |    |       |    |    | forward F4 (FM6** to **A1) in cc12              |
|     | L.D     | F6, 0(R2)    |    |    | F1 | F2       | stall | stall | ID   | EX    | M1    | M2    | WB  |       |      |    |       |     |    |       |    |    |                                                 |
|     | ADD.D   | F6, F4, F6   |    |    |    | F1       | stall | stall | F2   | stall | stall | stall | ID  | stall | **A1 | A2 | A3*   | WB  |    |       |    |    | forward F6 (A3* to *M1) in cc15                 |
|     | S.D     | 0(R2), F6    |    |    |    |          | stall | stall | F1   | stall | stall | stall | F2  | stall | ID   | EX | stall | 'M1 | M2 | WB    |    |    |                                                 |
|     | DADDUI  | R1, R1, #8   |    |    |    |          |       |       |      | stall | stall | stall | F1  | stall | F2   | ID | stall | EX  | WB |       |    |    | no need to forward R1 as WB and ID both on cc17 |
|     | DADDUI  | R2, R2, #8   |    |    |    |          |       |       |      |       |       |       |     |       | F1   | F2 | stall | ID  | EX | WB    |    |    |                                                 |
|     | DSGTUI  | R3, R1, done |    |    |    |          |       |       |      |       |       |       |     |       |      | F1 | stall | F2  | ID | EX    | WB |    |                                                 |
|     | BEQZ    | R3, loop     |    |    |    |          |       |       |      |       |       |       |     |       |      |    |       | F1  | F2 | stall | ID |    |                                                 |
|     | NO-OP   |              |    |    |    |          |       |       |      |       |       |       |     |       |      |    |       |     | F1 | stall | F2 |    |                                                 |
|     | NO-OP   |              |    |    |    |          |       |       |      |       |       |       |     |       |      |    |       |     |    | stall | F1 | F2 |                                                 |

# Part (b): DAXPY Loop, No Unrolling, with Scheduling

The timing diagram is shown in Figure 2. The timing diagram is also included in the submission as 4DM4-Assignment-#2b-Basic-Timing-Table-Group-47-RH,AP.xlsx. The performance, or MFLOP rating, of the implementation with a 3 GHz clock is  $(3 \text{ GHz})^*(2 \text{ FLOP}/13 \text{ cc}) = 461.5 \text{ MFLOP/s}.$ 

Figure 2: Timing Diagram for DAXPY Loop (No Unrolling, Scheduling)

| Instru    | ıction       |    |    |    | Clock Cyc | cle |       |      |     |     |       |       |       |      |    |    |    |    |    | Comment                              |
|-----------|--------------|----|----|----|-----------|-----|-------|------|-----|-----|-------|-------|-------|------|----|----|----|----|----|--------------------------------------|
|           |              | 1  | 2  | 3  | 4         | 5   | 6     | 7    | 8   | 9   | 10    | 11    | 12    | 13   | 14 | 15 | 16 | 17 | 18 |                                      |
| loop: L.D | F2, 0(R1)    | F1 | F2 | ID | EX        | M1  | M2*   | WB   |     |     |       |       |       |      |    |    |    |    |    | forward F2 (from M2* to *FM1) in cc6 |
| L.D       | F6, 0(R2)    |    | F1 | F2 | ID        | EX  | M1    | M2   | WB  |     |       |       |       |      |    |    |    |    |    |                                      |
| MULT.D    | F4, F2, F0   |    |    | F1 | F2        | ID  | stall | *FM1 | FM2 | FM3 | FM4   | FM5   | FM6** | WB   |    |    |    |    |    | forward F4 (FM6** to **A1) in cc12   |
| DADDUI    | R1, R1, #8   |    |    |    | F1        | F2  | stall | ID   | EX* | WB  |       |       |       |      |    |    |    |    |    | forward R1 (EX* to *EX) in cc8       |
| DSGTUI    | R3, R1, done |    |    |    |           | F1  | stall | F2   | ID  | *EX | WB    |       |       |      |    |    |    |    |    |                                      |
| ADD.D     | F6, F4, F6   |    |    |    |           |     |       | F1   | F2  | ID  | stall | stall | stall | **A1 | A2 | A3 | WB |    |    |                                      |
| BEQZ      | R3, loop     |    |    |    |           |     |       |      | F1  | F2  | stall | stall | stall | ID   |    |    |    |    |    |                                      |
| S.D       | 0(R2), F6    |    |    |    |           |     |       |      |     | F1  | stall | stall | stall | F2   | ID | EX | M1 | M2 | WB |                                      |
| DADDUI    | R2, R2, #8   |    |    |    |           |     |       |      |     |     |       |       |       | F1   | F2 | ID | EX | WB |    |                                      |

# Part (c): DAXPY Loop, With Unrolling, with no Scheduling

The compressed timing diagram is shown in Table 1. The compressed timing diagram is also included in the submission as 4DM4-Assignment-#2c-Compressed-Timing-Table-single-issue-Group-47-RH,AP.xlsx. The performance, or MFLOP rating, of the implementation with a 3 GHz clock is  $(3 \text{ GHz})^*(8 \text{ FLOP}/61 \text{ cc}) = 393.4 \text{ MFLOP/s}$ .

Table 1: Compressed Timing Diagram for DAXPY Loop (Unrolling, No Scheduling)

| Instruction Slot #1 | IF (F1,F2) | ID  | EX (Int, FP)         | MEM (M1,M2) | WB | Comment/Hazard                                                                              |
|---------------------|------------|-----|----------------------|-------------|----|---------------------------------------------------------------------------------------------|
| loop: L.D F2, 0(R1) | 1,2        | 3   | 4                    | 5,6*        | 7  | forward F2 (from M2* to *FM1) in cc6                                                        |
| MULT.D F4, F2, F0   | 2,3        | 4   | *7,8,9,10,11,12**    |             | 13 | forward F4 (FM6** to **A1) in cc12<br>stall on FM1 waiting for F2                           |
| L.D F6, 0(R2)       | 3,4        | 7   | 8                    | 9,10        | 11 |                                                                                             |
| ADD.D F6, F4, F6    | 5,7        | 11  | **13,14,15*          |             | 16 | forward F6 (A3* to *M1) in cc15<br>stall on ID waiting for F6<br>stall on A1 waiting for F4 |
| S.D 0(R2), F6       | 7,11       | 13  | 14                   | *16,17      | 18 | stall on M1 waiting for F6                                                                  |
| DADDUI R1, R1, #8   | 11,13      | 14  | 16                   | ,           | 17 | 0                                                                                           |
| DADDUI R2, R2, #8   | 13,14      | 16  | 17                   |             | 18 |                                                                                             |
| L.D F2, 0(R1)       | 14,16      | 17  | 18                   | 19,20*      | 21 | forward F2 (from M2* to *FM1) in cc20                                                       |
| MULT.D F4, F2, F0   | 16,17      | 18  | *21,22,23,24,25,26** |             | 27 | forward F4 (FM6** to **A1) in cc26<br>stall on FM1 waiting for F2                           |
| L.D F6, 0(R2)       | 17,18      | 21  | 22                   | 23,24       | 25 |                                                                                             |
| ADD.D F6, F4, F6    | 18,21      | 25  | **27,28,29*          |             | 30 | forward F6 (A3* to *M1) in cc29<br>stall on ID waiting for F6<br>stall on A1 waiting for F4 |
| S.D 0(R2), F6       | 21,25      | 27  | 28                   | *30,31      | 32 | stall on M1 waiting for F6                                                                  |
| DADDUI R1, R1, #8   | 25,27      | 28  | 30                   |             | 31 | -                                                                                           |
| DADDUI R2, R2, #8   | 27,28      | 30  | 31                   |             | 32 |                                                                                             |
| L.D F2, 0(R1)       | 28,30      | 31  | 32                   | 33,34*      | 35 | forward F2 (from M2* to *FM1) in cc34                                                       |
| MULT.D F4, F2, F0   | 30,31      | 32  | *35,36,37,38,39,40** |             | 41 | forward F4 (FM6** to **A1) in cc40<br>stall on FM1 waiting for F2                           |
| L.D F6, 0(R2)       | 31,32      | 35  | 36                   | 37,38       | 39 |                                                                                             |
| ADD.D F6, F4, F6    | 32,35      | 39  | **41,42,43*          |             | 44 | forward F6 (A3* to *M1) in cc43<br>stall on ID waiting for F6<br>stall on A1 waiting for F4 |
| S.D 0(R2), F6       | 35,39      | 41  | 42                   | *44,45      | 46 | stall on M1 waiting for F6                                                                  |
| DADDUI R1, R1, #8   | 39,41      | 42  | 44                   |             | 45 |                                                                                             |
| DADDUI R2, R2, #8   | 41,42      | 44  | 45                   |             | 46 |                                                                                             |
| L.D F2, 0(R1)       | 42,44      | 45  | 46                   | 47,48*      | 49 | forward F2 (from M2* to *FM1) in cc48                                                       |
| MULT.D F4, F2, F0   | 44,45      | 46  | *49,50,51,52,53,54** |             | 55 | forward F4 (FM6** to **A1) in cc54<br>stall on FM1 waiting for F2                           |
| L.D F6, 0(R2)       | 45,46      | 49  | 50                   | 51,52       | 53 |                                                                                             |
| ADD.D F6, F4, F6    | 46,49      | 53  | **55,56,57*          |             | 58 | forward F6 (A3* to *M1) in cc57<br>stall on ID waiting for F6<br>stall on A1 waiting for F4 |
| S.D 0(R2), F6       | 49,53      | 55  | 56                   | *58,59      | 60 | stall on M1 waiting for F6                                                                  |
| DADDUI R1, R1, #8   | 53,55      | 56  | 58                   |             | 59 | ~                                                                                           |
| DADDUI R2, R2, #8   | 55,56      | 58  | 59                   |             | 60 |                                                                                             |
| DSGTUI R3, R1, done | 56,58      | 59  | 60*                  |             | 61 | forward R3 (EX* to *ID) in cc60                                                             |
| BEQZ R3, loop       | 58,59      | *61 |                      |             |    | stall on ID waiting for R3                                                                  |
| NO-OP               | 59,61      |     |                      |             |    | branch-delay slot                                                                           |
| NO-OP               | 61,62      |     |                      |             |    | branch-delay slot                                                                           |

# Part (d): DAXPY Loop, With Unrolling, and with Scheduling

The compressed timing diagram is shown in Table 2. The compressed timing diagram is also included in the submission as 4DM4-Assignment-#2d-Compressed-Timing-Table-single-

issue-Group-47-RH,AP.xlsx. The performance, or MFLOP rating, of the implementation with a 3 GHz clock is  $(3 \text{ GHz})^*(8 \text{ FLOP}/24 \text{ cc}) = 1000 \text{ MFLOP/s}$ .

Table 2: Compressed Timing Diagram for DAXPY Loop (Unrolling, Scheduling)

| Instruction Slot #1 | IF (F1,F2) | ID | EX (Int, FP)       | MEM (M1,M2) | WB | Comment/Hazard                         |
|---------------------|------------|----|--------------------|-------------|----|----------------------------------------|
| loop: L.D F1, 0(R1) | 1,2        | 3  | 4                  | 5,6         | 7  |                                        |
| L.D F4, 8(R1)       | 2,3        | 4  | 5                  | 6,7         | 8  |                                        |
| L.D F7, 16(R1)      | 3,4        | 5  | 6                  | 7,8         | 9  |                                        |
| L.D F10, 24(R1)     | 4,5        | 6  | 7                  | 8,9         | 10 |                                        |
| L.D F3, 0(R2)       | 5,6        | 7  | 8                  | 9,10        | 11 |                                        |
| L.D F6, 8(R2)       | 6,7        | 8  | 9                  | 10,11       | 12 |                                        |
| L.D F9, 16(R2)      | 7,8        | 9  | 10                 | 11,12       | 13 |                                        |
| L.D F12, 24(R2)     | 8,9        | 10 | 11                 | 12,13       | 14 |                                        |
| MULT.D F2, F1, F0   | 9,10       | 11 | 12,13,14,15,16,17* |             | 18 | forward F2 (from FM6* to *A1) in cc17  |
| MULT.D F5, F4, F0   | 10,11      | 12 | 13,14,15,16,17,18* |             | 19 | forward F5 (from FM6* to *A1) in cc18  |
| MULT.D F8, F7, F0   | 11,12      | 13 | 14,15,16,17,18,19* |             | 20 | forward F8 (from FM6* to *A1) in cc19  |
| MULT.D F11, F10, F0 | 12,13      | 14 | 15,16,17,18,19,20* |             | 21 | forward F11 (from FM6* to *A1) in cc20 |
| DADDUI R1, R1, #32  | 13,14      | 15 | 16*                |             | 17 | forward R1 (from EX* to *EX) in cc16   |
| DSGTUI R3, R1, done | 14,15      | 16 | *17                |             | 18 |                                        |
| ADD.D F3, F2, F3    | 15,16      | 17 | *18,19,20          |             | 21 |                                        |
| ADD.D F6, F5, F6    | 16,17      | 18 | *19,20,21          |             | 22 |                                        |
| ADD.D F9, F8, F9    | 17,18      | 19 | *20,21,22          |             | 23 |                                        |
| ADD.D F12, F11, F12 | 18,19      | 20 | *21,22,23          |             | 24 |                                        |
| S.D 0(R2), F3       | 19,20      | 21 | 22                 | 23,24       | 25 |                                        |
| S.D 8(R2), F6       | 20,21      | 22 | 23                 | 24,25       | 26 |                                        |
| S.D 16(R2), F9      | 21,22      | 23 | 24                 | 25,26       | 27 |                                        |
| BEQZ R3, loop       | 22,23      | 24 |                    |             |    |                                        |
| S.D 24(R2), F12     | 23,24      | 25 | 26                 | 27,28       | 29 |                                        |
| DADDUI R2, R2, #32  | 24,25      | 26 | 27                 |             | 28 |                                        |

# Part (e): DAXPY Loop, With Unrolling and Scheduling. On Dual-Issue Machine

The compressed timing diagram is shown in Table 3. The compressed timing diagram is also included in the submission as 4DM4-Assignment-#2e-Compressed-Timing-Table-dual-issue-Group-47-RH,AP.xlsx. The performance, or MFLOP rating, of the implementation with a 3 GHz clock is  $(3 \text{ GHz})^*(8 \text{ FLOP}/16 \text{ cc}) = 1500 \text{ MFLOP/s}$ .

Table 3: Compressed Timing Diagram for DAXPY Loop (Unrolling, Scheduling, Dual Issue)

| Instruction Slot #1 | Instruction Slot #2 | IF    | ID | EX1 | MEM1   | WB1 | EX2      | MEM2 | WB2 | Comment/Hazard                      |
|---------------------|---------------------|-------|----|-----|--------|-----|----------|------|-----|-------------------------------------|
| loop: L.D F1, 0(R1) | NO-OP               | 1,2   | 3  | 4   | 5,6*   | 7   | -        | -    | -   | forward F1 (M2* to *FM1) in cc6     |
| L.D F4, 8(R1)       | NO-OP               | 2,3   | 4  | 5   | 6,7*   | 8   | -        | -    | -   | forward F4 (M2* to *FM1) in cc7     |
| L.D F7, 16(R1)      | NO-OP               | 3,4   | 5  | 6   | 7,8*   | 9   | -        | -    | -   | forward F7 (M2* to *FM1) in cc8     |
| L.D F10, 24(R1)     | MULT.D F2, F1, F0   | 4.5   | 6  | 7   | 8.9*   | 10  | *7-12**  |      | 13  | forward F10 (M2* to *FM1) in cc9    |
|                     | , ,                 | 4,0   | 0  | •   | 0,5    | 10  |          |      | 10  | forward F2 (FM6** to **A1) in cc12  |
| L.D F3, 0(R2)       | MULT.D F5, F4, F0   | 5,6   | 7  | 8   | 9,10   | 11  | *8-13**  |      | 14  | forward F5 (FM6** to **A1) in cc13  |
| L.D F6, 8(R2)       | MULT.D F8, F7, F0   | 6,7   | 8  | 9   | 10,11  | 12  | *9-14**  |      | 15  | forward F8 (FM6** to **A1) in cc14  |
| L.D F9, 16(R2)      | MULT.D F11, F10, F0 | 7,8   | 9  | 10  | 11,12  | 13  | *10-15** |      | 16  | forward F11 (FM6** to **A1) in cc15 |
| L.D F12, 24(R2)     | DADDUI R1, R1, #32  | 8,9   | 10 | 11  | 12,13  | 14  | 11*      |      | 11  | forward R1 (EX* to *EX) in cc11     |
| DADDUI R2, R2, #32  | DSGTUI R3, R1, done | 9,10  | 11 | 12  |        | 13  | *12      |      | 12  |                                     |
| NO-OP               | ADD.D F3, F2, F3    | 10,11 | 12 | -   | -      | -   | **13-15* |      | 16  | forward F3 (A3* to *M1) in cc15     |
| NO-OP               | ADD.D F6, F5, F6    | 11,12 | 13 | -   | -      | -   | **14-16* |      | 17  | forward F6 (A3* to *M1) in cc16     |
| NO-OP               | ADD.D F9, F8, F9    | 12,13 | 14 | 15  | *16,17 | 18  | **15-17  |      | 18* | forward F9 (WB* to *M1) in cc18     |
| S.D -32(R2), F3     | ADD.D F12, F11, F12 | 13,14 | 15 | 16  | *17,18 | 19  | **16-18  |      | 19* | forward F12 (WB* to *M1) in cc19    |
| S.D -24(R2), F6     | BEQZ R3, loop       | 14,15 | 16 |     |        |     | -        | -    | -   |                                     |
| S.D -16(R2), F9     | NO-OP               | 15,16 | 17 | 18  | *19,20 | 21  | -        | -    | -   |                                     |
| S.D -8(R2), F12     | NO-OP               | 16,17 | 18 | 19  | *20,21 | 22  | -        | -    | -   |                                     |